Search CORE

535 research outputs found

Adaptive text mining: Inferring structure from sequences

Author: Witten Ian H.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively

Research Commons@Waikato

Digital libraries for the developing world

Author: Witten Ian H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Digital libraries (DLs) are the killer app for information technology in developing countries. Priorities here include health, agriculture, nutrition, hygiene, sanitation, and safe drinking water. Computers are not a priority, but simple, reliable access to targeted information meeting these basic needs certainly is. DLs can assist human development by providing a non-commercial mechanism for distributing humanitarian information on topics such as health, agriculture, nutrition, hygiene, sanitation, and water supply. Many other areas, ranging from disaster relief to medical education, from the preservation and propagation of indigenous culture to educational material that addresses specific community problems, also benefit from new methods of information distribution

Research Commons@Waikato

Customizing digital library interfaces with Greenstone

Author: Witten Ian H.
Publication venue: IEEE Computer Society
Publication date: 01/01/2003
Field of study

Digital libraries are organized, focused collections of information. They are focused on a particular topic or theme—and good digital libraries will articulate the principles governing what is included. They are organized to make information accessible in particular, well-defined, ways—and good ones will include a description of how the information is organized (Lesk, 1997). The Greenstone digital library software is intended to help users construct simple collections of information very quickly. Indeed, only a few minutes of the user's time are needed to set up a collection based on a standard design and initiate the building process. Collections may be large—some comprise Gbytes of text; millions of documents. Furthermore, even larger volumes of information may be associated with a collection—typically audio, image, and video, with textual metadata. Once initiated, the mechanical process of building the collection may take from a few moments for a tiny collection to several hours for a multi-Gbyte one—perhaps even a day if it involves many different full-text indexes

Research Commons@Waikato

Creating and customizing digital library collections with the Greenstone Librarian Interface

Author: Witten Ian H.
Publication venue: 'Institute of Mathematics, University of Tsukuba'
Publication date: 01/01/2004
Field of study

The Greenstone digital library software is a comprehensive system for building and distributing digital library collections. It provides a new way of organizing information and publishing it on the Internet. This paper describes how digital library collections can be created and customized with the new Greenstone Librarian Interface. Its basic features allow users to add documents and metadata to collections, create new collections whose structure mirrors existing ones, and build collections and put them in place so for users to view. More advanced users can design and customize new collection structures. At the most advanced level, the Librarian Interface gives expert users interactive access to the full power of Greenstone, which could formerly be tapped only by running Perl scripts manually

Research Commons@Waikato

Classification

Author: Witten Ian H.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

In Classification learning, an algorithm is presented with a set of classified examples or ‘‘instances’’ from which it is expected to infer a way of classifying unseen instances into one of several ‘‘classes’’. Instances have a set of features or ‘‘attributes’’ whose values define that particular instance. Numeric prediction, or ‘‘regression,’’ is a variant of classification learning in which the class attribute is numeric rather than categorical. Classification learning is sometimes called supervised because the method operates under supervision by being provided with the actual outcome for each of the training instances. This contrasts with Data clustering (see entry Data Clustering), where the classes are not given, and with Association learning (see entry Association Learning), which seeks any association – not just one that predicts the class

Research Commons@Waikato

The Development and Usage of the Greenstone Digital Library Software

Author: Witten Ian H.
Publication venue: ASIS&T
Publication date: 01/01/2008
Field of study

The Greenstone software has helped spread the practical impact of digital library technology throughout the world-particularly in developing countries. This article reviews the project’s origins, usage, and the development of support mechanisms for Greenstone users. We begin with a brief summary of salient aspects of this open source software package and its user population. Next we describe how its international, humanitarian focus arose. We then review the special requirements imposed by the conditions that prevail in developing courtiers. Finally we discuss efforts to establish regional support organizations for Greenstone in India and Africa

Research Commons@Waikato

Thesaurus-based index term extraction for agricultural documents

Author: Medelyan Olena
Witten Ian H.
Publication venue: EFITA/WICCA
Publication date: 01/01/2005
Field of study

This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction

CiteSeerX

Research Commons@Waikato

Computer graphics techniques for modeling page turning

Author: Liesaputra Veronica
Witten Ian H.
Publication venue: University of Waikato, Department of Computer Science
Publication date: 24/10/2007
Field of study

Turning the page is a mechanical part of the cognitive act of reading that we do literally unthinkingly. Interest in realistic book models for digital libraries and other online documents is growing. Yet actually producing a computer graphics implementation for modeling page turning is a challenging undertaking. There are many possible foundations: two-dimensional models that use reflection and rotation; geometrical models using cylinders or cones; mass-spring models that simulate the mechanical properties of paper at varying degrees of fidelity; finite-element models that directly compute the actual forces within a piece of paper. Even the simplest methods are not trivial, and the more sophisticated ones involve detailed physical and mathematical models. The variety, intricacy and complexity of possible ways of simulating this fundamental act of reading is virtually unknown. This paper surveys computer graphics models for page turning. It combines a tutorial introduction that covers the range of possibilities and complexities with a mathematical synopsis of each model in sufficient detail to serve as a basis for implementation. Illustrations are included that are generated by our implementations of each model. The techniques presented include geometric methods (both two- and three-dimensional), mass-spring models with varying degrees of accuracy and complexity, and finite-element models. We include a detailed comparison of experimentally-determined computation time and subjective visual fidelity for all methods discussed. The simpler techniques support convincing real-time implementations on ordinary workstations

Research Commons@Waikato

Measuring inter-indexer consistency using a thesaurus

Author: Medelyan Olena
Witten Ian H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

When professional indexers independently assign terms to a given document, the term sets generally differ between indexers. Studies of inter-indexer consistency measure the percentage of matching index terms, but none of them consider the semantic relationships that exist amongst these terms. We propose to represent multiple-indexers data in a vector space and use the cosine metric as a new consistency measure that can be extended by semantic relations between index terms. We believe that this new measure is more accurate and realistic than existing ones and therefore more suitable for evaluation of automatically extracted index terms

Crossref

Research Commons@Waikato

Teaching agents to learn: from user study to implementation

Author: Maulsby David
Witten Ian H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/1997
Field of study

Graphical user interfaces have helped center computer use on viewing and editing, rather than on programming. Yet the need for end-user programming continues to grow. Software developers have responded to the demand with a barrage of customizable applications and operating systems. But the learning curve associated with a high level of customizability-even in GUI-based operating systems-often prevents users from easily modifying their software. Ironically, the question has become, "What is the easiest way for end users to program?" Perhaps the best way to customize a program, given current interface and software design, is for users to annotate tasks-verbally or via the keyboard-as they are executing them. Experiments have shown that users can "teach" a computer most easily by demonstrating a desired behavior. But the teaching approach raises new questions about how the system, as a learning machine, will correlate, generalize, and disambiguate a user's instructions. To understand how best to create a system that can learn, the authors conducted an experiment in which users attempt to train an intelligent agent to edit a bibliography. Armed with the results of these experiments, the authors implemented an interactive machine learning system, which they call Configurable Instructible Machine Architecture. Designed to acquire behavior concepts from few examples, Cima keeps users informed and allows them to influence the course of learning. Programming by demonstration reduces boring, repetitive work. Perhaps the most important lesson the authors learned is the value of involving users in the design process. By testing and critiquing their design ideas, users keep the designers focused on their objective: agents that make computer-based work more productive and more enjoyable

Research Commons@Waikato